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Abstract 

Cluster-weighted modeling (CWM) is a mixture approach for modeling the joint 
probability of a response variable and a set of explanatory variables. The parame- 
ters are estimated by means of the expectation-maximization algorithm according 
to the maximum likelihood approach. We show that, under suitable hypotheses, 
the maximization of the likelihood function of Gaussian cluster weighted models 
leads to the same parameter estimates of finite mixtures of regression and finite 
mixtures of regression with concomitant variables. In this sense, the latter ones 
can be considered as nested models of Gaussian cluster weighted models. 

Keywords: Cluster-weighted modeling, finite mixtures of regression, 
EM-algorithm 



1. Introduction 

Cluster-weighted modeling (CWM) is a mixture approach to modeling the 
joint probability density of a response v ariable and a s et of e xplanatory variables. 
The original formulation, proposed by iGershenfeldl (| 19971) under Gaussian and 
linear assumptions, was developed in the context of media tec hnology in order to 
build a digital violin with traditional inputs and realistic sound (|Gershenfeld et all 



999 



GershenfeldL 119991 : ISchoneit 1200a ISchoner and GershenfeldL 1200 lh. IW edel 



(2002) refers to such a model as the saturated mixture regression model. Ilngrassia et al. 



(120111) reformulated CWM from a statistical point of view in a wide framework; in 

particular, under Gaussian assumptions (Gaussian CWM) we investig ate the rela- 

tionships between CWM and both finite mixtures of regression (FMR) (|De Sarbo and Cronl . 
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19881 : McLachlan and Peel l2000l : iFruhwirth-Schnattei , 2005 ) and finite mixtures 



of regression with concomitant variables (FMRC) (IDayton and Macreadyl . 1 1988 



WedelL 12002) 



The parameters of cluster weighted models are estimated through the EM al- 
gorithm according to the maximum likelihood approach. In this paper, we show 
that, under suitable hypotheses, the maximization of the likelihood function of 
Gaussian CWM leads to the same parameter estimates of FMR and FMRC. In this 
sense, FMR and FMRC can be considered as nested models of Gaussian CWM. 

The remainder of the paper is organized as follows. In Section [2l we review 
CWM as a general framework for mixture modeling. In Section[3l we analyse the 
complete-data likelihood function of Gaussian CWM and derive the main steps of 
the EM algorithm for parameter estimation. In Section|4]we show that, under suit- 
able hypotheses, the maximization of the likelihood function of Gaussian CWM 
leads to the same parameter estimates of FMR and FMRC. Finally, in Section [51 
we provide some conclusions and further research. 

2. Cluster-Weighted Modeling 

Let (X, Y) be the pair of random vector X and random variable Y defined 
on f2 with joint probability distribution p(x, y), where X is the rf-dimensional 
input vector with values in some space X C R d and Y is a response variable 
having values in y C R. Thus, (x, y) G X x y C R d+1 . Suppose that Q can 
be partitioned into G disjoint groups, say . . . , Vt G , that is Q — fii U ■ • • U ttc- 
CWM decomposes the joint probability p(x, y) as follows: 

G 

p(x,y,0) = ^2p{y\x,Q g ) p(x\Q g ) 7T g , (1) 

3=1 

where p(y\x, Cl g ) is the conditional density of the response variable Y given the 
predictor vector x and Q, g , p(x\Vt g ) is the probability density of x given Vt 



p(£l g ) is the mixing weight of (7r 9 > and Yl Q =i = 1)' 9 = 1, • • • , C, 



Tin 



>r- 



and denotes the set of all parameters of the model. Hence, the joint density 
of (X, Y) can be viewed as a mixture of local models p(y\x, Q g ) weighted (in a 
broader sense) on both local densities p(x\Vt g ) and mixing weights % g . 

The posterior probability p(VL g \ x, y) of unit (x, y) to come from the g-th group 
(g = 1, ...,G) is given by: 

p (q \ x v ) = gfoyjjg) = p(y\x,n g ) P (x\n g )7T g 

n gl ' y) ~ P (x,y) ~ EtMyl^MxPiW 
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In particular, the classification of each unit depends on both marginal and condi- 
tional densities. 

In the traditional framework, local densities p(x\Q g ) are assumed to be mul- 
tivariate Gaussian with parameters (/j, , S 5 ), that is X\£l g ~ Na(n g , S s ), g = 
1, . . . , G. Moreover, conditional densities p(y\x, £l g ) are modeled by Gaussian 
distributions with variance a^ g around some deterministic function of x, say 
fi(x; (3 ), g = 1, . . . , G, so that the relationship between Y and X in the g-th 
group can be written as Y = p,(x, (3 g ) + e g where e g ~ iV(0, v 2 e g ). Such model 
will be referred to as Gaussian CWM: 

G 

P(x, y;0) = Pg)i a L) *V S s) ^fl' 

3=1 

where 0(-) denotes the probability density of Gaussian distributions. 

For sake of simplicity, we consider the case concerning conditional densities 
based on linear mappings, that is fi(x; f3 g ) = b' g x + b g0 , with (3 = (b' b g0 )', 
b g e R d and b g0 G K: 

G 

P(x, y;0) = Y^ <t>{v\ b 'g x + & 90, o% >g ) <f>d( x \ n g , S s ) tt 9 , (3) 

9=1 

which will be referred to as linear Gaussian CWM. 



3. The likelihood function of Gaussian CWM 

Let (xi, yi), . . . , (xn, t/jv) be a sample of N independent observation pairs 
drawn from model in ©. Then, the corresponding likelihood function is given 
by: 

N n [ G 

L (0;X,y) = Y[p(x n ,y n ;0) = ]J ^2<f)(y n \x n ;x g )M x n;ip g )^g , 

n=l n=l |_9=1 

where x g — (Pgi a g) an d *l> g = (M^^g)- Maximization of L (6;X,y) with 
respect to 6 yields the maximum likelihood estimate of 6. 
Let us consider fully categorized data: 

{w n : n = 1, . . . , N} = {(cc n , y n , « n ) : n = 1, . . . , N}, 
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where z n = (z n \, . . . , z ng )' , with z ng = 1 if (x n , y n ) comes from the g-th popu- 
lation and z ng = otherwise. Then, the complete-data likelihood function corre- 
sponding to W = (w\, . . . ,wn) can be written in the form: 



N G 



L c (6;X,y) = ]J 1^1^; X 9 )]^[M^ ^ 

n=l g=l 

Taking the logarithm of © after some algebra we get: 
£ c (d;X,y)=\nL c (0;X,y) 

N G 

= ^2^2[zng\n(f)(y n \x n ; x g ) + z ng lncj) d (x n ] ij> g ) + 



n=l g=l 

Clc(x) + 4ac(V0 + Ac(7T) 



where 



1 iV G 

n=l g=l 
N G 



-»;/ 



In 27r — In <r . — 



a 



e,9 



PM) = ^EE^s [-pln27r-ln|S s | - (aj n - (J, g )"E g 1 (x n - fj,, 

n=l g=l 

N G 

n=l g=l 



(5) 

(6) 
(7) 
(8) 



Log-likelihood function © can be maximized through the EM algorithm in 
order to obtain the parameter estimates 6 = {x, V'j tt}- The E-step on the (k + 1)- 
th iteration of the EM algorithm requires the calculation of the conditional ex- 
pectation of the complete-data log-likelihood function C c (6\ X, y) in ©, say 
Q(0, 6 {k) ), evaluated using the current fit for 6. Since £ c (0; X, y) is lin- 
ear in the unobservable data z ng , this means calculating the current conditional 
expectation of Z ng given X and y, where Z ng is the random variable correspond- 
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ing to z ng , that is 



Q(0,OW)=E m {£ c (0;X,y)} 

N G 

= ^^E e(fc) {Z n9 | : r n ,y n }[g 1 (x 9 ^ (fc) ) + Q 2 (V' 9 ^ (fc) ; 

n=\ g=l 
N G 

n=l g=l 



■ 7T, 



3 J • 



In 



7T„ 



where 



(k) 

»!l 



(fe) 2( fe )^ 



E?=i )( t>{yn\x n ] PT', <T? K ')fa(x n ; 



(k) 2(kh 



(k) „(fc), 



provides the current value of (0 on the /^-iteration and 



Qi(x g ;0 {k) ] 



In 2 



7T 



In 



[y n - (b' x n - b 0g )]' 



£,9 



(J, 



Q 2 (*P g ; e&) = ~ [-p\n2-n - In |S 9 | - (x n - iJ, g )'T lg 1 {x n - 



The M-step on the (k+ l)-th iteration of the EM algorithm requires the maxi- 
mization of the conditional expectation of the complete-data log-likelihood Q(6, 0^ k ') 
with respect to 6. The solutions for posterior probabilities vrg fe+1 ^ and parameters 



f/z ^, S< fc+1) ) of local densiti es (j> d {x n \^ g ), g 
(e.g. iMcLachlan and Peell2000h . that is: 



G, exist in closed form 



7T 



(fc+1) 



i n 

AT n 9 ' 

71=1 

Z^n=l Tn 9 ~ 

V 7V (fe) 



£j(fc+l) _ Eb=1 T ng{x n fj, g ^)(x r 



V 7V (k) 



(9) 



The updates b^ +1 \ b^ 1 ^ 1 and <7^ g +r) for parameters of local densities (f)(y n \x n ; x a 



5 



g — 1, . . . , G, are obtained by solving the equations: 



dE e(k) {C c (^\x n ,y n )} ^„ (k) dQx(x 9 ,0 {k) ) 







5E e Cfe){>C c (V|cCn, 


yn)} 






9E e(fc ){£ c ('0|^n, 


2/n)} 





db g o 








db> 


«9Qi 





n=l 
N 

L^l n 9 p)h' ' V ^ 

n=l 

A? 

e >a n=l 



yielding 



v^Af (fc) v^Af (fc) / 

,(fc+l) _ Z^n=l T "g _ » (fc+1) Z^n=l r "3 
°90 " „at (fc) °g „at (fc) J 



n=l Tn 9 2^n=l 



T, 



11 U 



r (fc) w a;' V W r { % r (fe) i' 



9 I ( fc ) ( fe ) Y^ ( fe ) 



Z^n=l /n 9 Z^n=l ln 9 Z-m=l 



'HI 



V-AT (fc) , / V 7V (fc) , 



2 N 



Y^ _(*) I V^iV (fc) 

2(fc+l) _ Z^n=l '"9 [ifa l u 3 ^ "gO )\ 

a e,g ~ jv (fc) ■ 

l^n=l Tn 9 

See Appendix for computational details. 

4. Maximum likelihood estimates of Gaussian CWM and relationships with 
FMR and FMRC 



In this section, we analyse the relationships between maximum likelihood es- 
timates of Gaussian CWM and both FMR and FMRC. To begin with, we show 
in the following that, under suitable hypotheses, maximization of the likelihood 
function of Gaussian CWM leads to the same parameter estimates of FMR and 
FMRC. In this sense, FMR and FMRC can be considered as nested models of 
Gaussian CWM. 
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4.1. Relationship with FMR 



L et us consider the density fun ction of FMR (IDe Sarbo and Cronlll988l : lMcLachlan and Peell . 



20001 : iFruhwirth-Schnatteri . 120051) : 

G 



f(y\x; ip ) = ^2p(y\x, n g )7r g = <t>{y\ b ' 9 x + b ^ °\ g ) * 

9=1 3=1 



'J' 



where if) denotes the overall parameters of the model. 

The corresponding complete-data log-likelihood function is: 

N G 

C c (ip ;X,y) = ^2^2(z ng \ii(j)(y n \x n ;x g ) + Zng^nng) 

n=l g=l 

N G N G 

= ^2^2z ng \n4>(y n \x n ;x g ) + ^2^2z ng \nn g 

n=l 9=1 n=l 9=1 

= £ 1C (X) + Ac(7r). (13) 

Proposition 1. In model ©, if local densities 4>d(x; if) g ) have the same parameters 
^9 = (*V S 9) = (/*> S ) = that is 

M x ;^ g ) = 0dO; V0> g = i,-G, (14) 

then maximum likelihood estimate of 7r) in (TT3T) coincides with the corre- 
sponding estimate in ©. 

Proof. In order to prove the proposition, it is sufficient to show that, under the 
assumption that if> = (/x , E 9 ) = (/x, S) = iff, terms £i c (x) and £3 C (7r) in © 
do not depend on (/x , S g ), <? = 1, . . . , G. Indeed, under (PT4l) . the complete-data 
log-likelihood function becomes: 

£ c (O;X,y) = \nL c (0 ] X,y) 

N G 

= ^2^2 \- Zn 9 111 ^(Vn\ X n] Xg) + Z ng In (fid(x n ; iff) + Z ng In TTg] 
n=l 9 =1 

= A c (x) + £LW + A c (7r), 

where /^(VO m © is now replaced by 



(15) 



N 1 

£*2cW = J2o ["P ln27r " In |S| - (x n - - /x)] 



n=l 



7 



since Y^=i z ng = 1 for n = 1, . . . , N. 



Moreover, in the E-step, the posterior probability Tng in © becomes: 

(fc) _ 



T, 



7T { g k) (t){y n \x n ; Xg k) )M x n\if> (k) ) ^ (t>{Vn\x n ] X?] 



n = 1, . . . , N and # = 1, . . . , G. 

Then, according to ©, term C 3c (ir) does not depend on if) . Thus, max- 
imization of © can be attained by independently maximizing the three terms 
£ic(x)i ^2c(V') an d ^3c(tt) and hence, maximization of (TT3l) and (TT5T) in the M- 
step leads to the same estimates of (%, 7r). This completes the proof. □ 

4.2. Relationship with FMRC 

Let us consider the density function of FMRC (e.g. Dayton and Macreadvl . 

1988b: 

G 

f*(y\x; VO = <f>& b > + b ^ < 9 ) P( n 9\x, > (16) 

9=1 

where the mixing weight p(fl g \x,£) is now a function depending on x through 
some parameters £ and is the augmented set of all parameters of the model. 

Probability p(fl g \x, £) is usually modeled by a multinomial logistic distribu- 
tion with the first component as baseline, that is: 



exp(w' g x + Wgo) 
EjLi exp(w'jX + w j0 ) 



p(n g \x,£) = ^ G g ; g; - . (17) 



In particular, equation (flTT) is satisfied if local densities p(a?|Q A q = 1. ...,(? , 
are assumed to be Gaussian with the same covariance matrices (| AndersonL 1 1 97 2r) . 
The complete-data log-likelihood function corresponding to (fT6l) is: 



TV G 
n=l g=l 

= £ lc (x)+£ 3c (£). (18) 

Proposition 2. In model ©, if local densities </><f (a?; ^ g ) have the same covariance 
matrices H g = T,, g = 1, . . . , G, and equal prior probabilities ix g = l/G, then 
maximum likelihood estimate of £) in ([TBI can be derived from the estimate 
of (x,V0 in©. 
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Proof. In order to prove the proposition, it is sufficient to show that, under 
assumptions £ 5 = £ and ir g = l/G, g = 1, . . . , G, terms Ac(x) and £3 C (7r) in 
§5$ do not depend on (fi g , S g ), g — 1, . . . , G. Indeed, we have: 

JV G 

L c (0;X,y) ^II^I^iX/^dKlM^) 2 " 9 ^ (19) 

n=l 3=1 

and taking the logarithm of (fT9l) , after some algebra we get 
C c (0;X,y) = In L c {0;X,y) 

N G 
n=l 3=1 

= £ lc (x)+£-(V0+7T, (20) 
where £ 2c (V') i n © i s now replaced by 

^ JV G 
n=l g=l 

Once the estimates of (/Lt ff , £) have been obtained, quantity p(£l g \x, in (fT8l) can 
be obtained immediately, that is: 

ro I >n (f>d{x n \ y. g ,Y,)^ _ exp - /x^'ST^a; - /x g )] 



K*») Ej=iexp [-\{x - ^'^(x - ^ 



which can be written in form (TT71) for suitable constants w g , w gQ , g = 1, . . . , G. 
This completes the proof. □ 

5. Conclusions 

In this paper, we presented an analysis of the complete-data likelihood func- 
tion of Gaussian CWM and derived the parameter estimates according to the EM 
algorithm. Afterwards, theoretical results showed that, under suitable assump- 
tions, both FMR and FMRC are nested models of Gaussian CWM. This implies 
that CWM is a quite general framework for local statistical modeling. 
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Appendix 



From equation ([TO]), for b% +1) (g = l,...,G) we obtain: 



N 

E 

n=l 



r, 



dQi{x q ^ {k) ) 



db r , 



yielding 

N 



n=l 



A 



n=l 



and then we get 



^Af (fe) 
7 0+1) _ Z^n=l Tn 9 Vn __ , (fc+1) 2^ 

Z^n=l ' n 9 Z^n=l ' n 9 



^ (As) , 
re =l T ng -^n 



For 6^ fe+1) (# = 1, . . . , G), equation ([H]) leads to: 



N 

E 

n=l 



, (fc) 9Qi(x g ;^ (fc) : 



»!7 



5b' 



0' 



which implies 



yielding 



A 



E^ + 



n=l 



< = 0' 



Z^n=l ln 9 (Jn^n Z^n=l 1 n 9 \)n Z^n=l 1 n 9 



^A 



-(*), 



>i 9 



"9 



2^n=l 



(fc) / 



EN 
n=l 



"9 



and finally 



\-^N (k) t 
l^n=\ Tn 9 Vn x n 



E 



A 



T (*) 



!/nE 



A 



(*0 / 

=1 r ™9 x r 



Y^Af 



_(*) 
<ng 



E 



N 

n=l Tn 9 



(k) 



Y^Af 



_(*) 



E 



A 



=1 / n g •Ln^n 



E 



A 



(*0 / 
=1 T «9 33 r 



r (fc) 
'ng 



Y^Af 
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Furthermore, equation (fT2l) leads to the current estimate of the variance a^g 
(g = l,...,G): 

_ {k) dQAXg^ 



leading to 



d<T* g 

n=l e '» 







N 

E 



■00 

ng 



and furthermore 



1 



2(fc) 1 4(fc) 



y n - ( b'<»x n + 



(7 



2(fc+l) 



V 7V (As) 



(23) 



Finally, we remark that in general case the equations © and (|2T|) are replaced by 



n=l 



and (1231) is replaced by 



2(fe+l) 
e,9 
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